Tricks for RNA-seq and other large data sets
Systems Biology Lab
2025-01-01
Making the best of our ignorance of causes
But what is the meaning of these p-values?
We measure concentrations \(a\) and \(b\) of a metabolite in person under conditions A and B. As the “boring/nothing special” hypothesis we propose the following Null Hypothesis or \(H_0\):
More precisely
To be useful in statistical testing we need the following information:
p-value: The probability of measuring the observed value or a more extreme value, when \(H_0\) is true
\[\text{p-value}=P(t \in \left(-\infty,-d\right] \cup \left[d,\infty \right))\]
Make sure that you have the knowledge of statistical analyses to answer the following questions:
The p-value of a statistical test is the probability that the observed or more extreme events happen if \(H_0\) is true.
The \(\alpha\) level, or a “p-value cut-off” is the accepted level of a false positive conclusions or Type I errors, i.e. the probability of rejecting \(H_0\) when \(H_0\) is true.
Answer to question 1:
Incorrect. This is a wrong interpretation of the p-value: something essential is missing. The assertion only holds for cases when the null hypothesis is true. But we do not know whether the null hypothesis is true, neither before nor after performing a test! Beware of the internet: you will find wrong statements everywhere (see below). This harms the correct interpretation of statistical tests.
\(\longleftarrow\) No, twice NO!
Exemplified by RNA-seq data
From Wang et al. 2009
Notice variation in the two random samples:
Five rows from the Pasilla data set1
gene_id | untreated1 | untreated2 | untreated3 | untreated4 | treated1 | treated2 | treated3 |
|---|---|---|---|---|---|---|---|
... | ... | ... | ... | ... | ... | ... | ... |
FBgn0020369 | 3387 | 4295 | 1315 | 1853 | 4884 | 2133 | 2165 |
FBgn0020370 | 3186 | 4305 | 1824 | 2094 | 3525 | 1973 | 2120 |
FBgn0020371 | 1 | 0 | 1 | 1 | 1 | 0 | 0 |
FBgn0020372 | 38 | 84 | 29 | 28 | 63 | 28 | 27 |
... | ... | ... | ... | ... | ... | ... | ... |
TOTAL COUNT | 13972512 | 21911438 | 8358426 | 9841335 | 18670279 | 9571826 | 10343856 |
Notice:
Variation in sequence fragments counted due to technical variation:
These all lead to variation in sequencing depth or total count between samples.
Simple method
Proposal: then \(X\) has a Poisson distribution with parameter \(\lambda\):
\[P \left( X = k \right) = \frac{\lambda^k e^{-\lambda}}{k!}\] where \(\lambda = p \times n\) is the expected number of red counts in a sample of \(n\) RNA fragments.
\[\lambda = \mu = \sigma^2 \qquad \rightarrow \sigma=\sqrt{\mu}\]
Two options:
Proportional error is the most common instrumental error: \(\sigma = a \cdot \mu\)
The standard deviation is a fixed percentage of the measured value
The data are heteroscedastic
On a logarithmic scale the slope equals 1: \[\log{(\sigma)} = \log{(a)} + \color{red}{1} \cdot \log{(\mu)}\]
\(s \propto \overline{x}^{0.7} \longrightarrow\) transformation: \(y_i = x_i^{0.3}\)
UMAP1 on original data and on log-transformed data
Variance-stabilizing transformations for NB-distributed data exist, see Huber & Holmes
It should be used for applications in which the counts are used, for example to assess the quality of the data:
Comparing gene expression under conditions A and B
| Gene | \(H_0\) |
|---|---|
| \(G_1\) | \(\mu^A_1 = \mu^B_1\) |
| \(G_2\) | \(\mu^A_2 = \mu^B_2\) |
| \(G_3\) | \(\mu^A_3 = \mu^B_3\) |
| \(\vdots\) | \(\vdots\) |
| \(G_n\) | \(\mu^A_n = \mu^B_n\) |
Comparing gene expression under conditions A and B
| Gene | \(H_0\) | p-value | test |
|---|---|---|---|
| \(G_1\) | \(\mu^A_1 = \mu^B_1\) | 0.47 | accept |
| \(G_2\) | \(\mu^A_2 = \mu^B_2\) | 0.0001 | reject |
| \(G_3\) | \(\mu^A_3 = \mu^B_3\) | 0.04 | reject |
| \(\vdots\) | \(\vdots\) | \(\vdots\) | \(\vdots\) |
| \(G_n\) | \(\mu^A_n = \mu^B_n\) | 0.74 | accept |
Suppose, we measure the expression of 5050 genes
The fraction of false positives in the list of all 300 positive calls equals \(\frac{250}{50+250} = 83\%\)!
Because the fraction of genes for which \(H_0\) is true is very large, namely \(\frac{5000}{5050}=0.99\). Therefore, even though we take \(\alpha=0.05\) we still get a lot of false positives.
What would be the fraction of false positives when we chose a more strict \(\alpha=0.01\)?
\[\alpha=\frac{\text{False positives}}{\text{True negatives}+\text{False positives}}\]
\[\text{FDR}=\frac{\text{False positives}}{\text{True positives} + \text{False positives}}\]
By moving the dashed line a we can change the FDR to our desire1
Based on a list of significant (positive) genes you might:
Please, browse through the book: other chapters may be relevant for this course and for future reference
Note that the term normalization has different meanings in statistics. You will likely encounter both!
The Negative Binomial distribution \(\mathrm{NB}(X=k)\) with parameters \(r\) and \(p\) can be obtained in different ways. One is the following:
Another way to generate the NB distribution is as a infinite mixture of Poisson distributions, whose parameters \(\lambda\) are themselves drawn from a gamma-distribution, see Ch4 in Holmes and Huber (2019). It is the reason why the NB distribution is also known as gamma-Poisson distribution.
You may encounter the concept local FDR (often indicated by lower case fdr), e.g. in Holmes and Huber.
In this slide about the FDR the local fdr is the ratio of blue line segment to total (blue + red) line segment on the line \(a\), whereas the total FDR is the ratio of blue area to total area left of the line \(a\). (The areas are the integral of the line segments). It estimates the likelihood of a positive call being a false positive.
You may encounter the concept q-value.
In a list of hypothesis tests ordered by increasing p-values, the q-value of a test is the FDR (expected fraction of false positives) of the collection of hypothesis tests up to and including that test.
You will often encounter the terms “corrected p-value” or “adjusted p-value” in the literature when the false discovery rate is meant.
Do not use these terms: they are confusing and imprecise because the meaning of an FDR is very different from that of a p-value. Make sure that you understand this comment.
See Päll et al. (2023) about the importance of checking the distribution of p-values.
These questions can only be answered conditional on the fact that the null hypothesis is true. But in none of these cases do we know the probability that the null hypothesis is true (see definition of p-value).
The BH algorithm (Benjamini1995)
Using the BH rejection threshold we get: FDR \(\leq \alpha\)
The uniform distribution generates a straight line with an offset that equals \(\pi_0\)
The key aspect is to have an estimate \(\hat{\pi}_0\) of \(\pi_0\)
\[p_{(k)}=\alpha \cdot q_k\]